{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "40fb2b99-f188-4634-a11e-672e65752afa",
   "metadata": {},
   "source": [
    "# LangSmith Evaluation 快速入门\n",
    "\n",
    "概况来说，评估（Evaluation）过程分为以下步骤：\n",
    "\n",
    "- 定义 LLM 应用或目标任务(Target Task)。\n",
    "- 创建或选择一个数据集来评估 LLM 应用。您的评估标准可能需要数据集中的预期输出。\n",
    "- 配置评估器（Evaluator）对 LLM 应用的输出进行打分（通常与预期输出/数据标注进行比较）。\n",
    "- 运行评估并查看结果。\n",
    "\n",
    "本教程展示一个非常简单的 LLM 应用（分类器）的评估流程，该应用会将输入数据标记为“有毒（Toxic）”或“无毒（Not Toxic）”。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4913e104-82e6-4932-8e80-2b8bd57553c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip show langsmith"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c8c8225-42bd-4c9b-adeb-62c83f80c9d3",
   "metadata": {},
   "source": [
    "## 1.定义目标任务\n",
    "\n",
    "我们定义了一个简单的评估目标，包括一个LLM Pipeline（将文本分类为有毒或无毒），并启用跟踪（Tracing）以捕获管道中每个步骤的输入和输出。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4cb8b089-8d3c-4f56-b5d3-2929dcb49c26",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith import traceable, wrappers\n",
    "from openai import Client\n",
    "\n",
    "# 包装 OpenAI 客户端\n",
    "openai = wrappers.wrap_openai(Client())\n",
    "\n",
    "# 标记函数可追踪\n",
    "@traceable\n",
    "def label_text(text):\n",
    "    # 创建消息列表，包含系统消息和用户消息\n",
    "    messages = [\n",
    "        {\n",
    "            \"role\": \"system\",\n",
    "            \"content\": \"请查看下面的用户查询，判断其中是否包含任何形式的有害行为，例如侮辱、威胁或高度负面的评论。如果有，请回复'Toxic'，如果没有，请回复'Not toxic'。\",\n",
    "        },\n",
    "        {\"role\": \"user\", \"content\": text},\n",
    "    ]\n",
    "    \n",
    "    # 调用聊天模型生成回复\n",
    "    result = openai.chat.completions.create(\n",
    "        messages=messages, model=\"gpt-3.5-turbo\", temperature=0\n",
    "    )\n",
    "    \n",
    "    # 返回模型的回复内容\n",
    "    return result.choices[0].message.content\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c58435a-b044-4939-9902-21e77ec47d42",
   "metadata": {},
   "source": [
    "## 2.创建或选择评估数据集\n",
    "\n",
    "下面创建一个评估数据集（包含有毒和无毒文本的示例）。该数据集中每个示例都包含三个字典/对象：\n",
    "- outputs：数据集中找到的参考标签或其他上下文\n",
    "- inputs：管道的输入\n",
    "- metadata：示例中存储的任何其他元数据\n",
    "\n",
    "这些字典/对象可以具有任意键（Key）和值（Value），但是所有示例中键必须保持一致。\n",
    "\n",
    "示例中的值也可以采用任何形式，例如字符串、数字、列表或字典，但对于本教程的示例，我们仅使用字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7b54f22e-17ae-41f2-a137-76e84fef9b49",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith import Client\n",
    "\n",
    "# 创建客户端\n",
    "client = Client()\n",
    "\n",
    "# 创建一个数据集\n",
    "examples = [\n",
    "    (\"Shut up, idiot\", \"Toxic\"),  # 有害\n",
    "    (\"You're a wonderful person\", \"Not toxic\"),  # 无害\n",
    "    (\"This is the worst thing ever\", \"Toxic\"),  # 有害\n",
    "    (\"I had a great day today\", \"Not toxic\"),  # 无害\n",
    "    (\"Nobody likes you\", \"Toxic\"),  # 有害\n",
    "    (\"This is unacceptable. I want to speak to the manager.\", \"Not toxic\"),  # 无害\n",
    "]\n",
    "\n",
    "# 数据集名称\n",
    "dataset_name = \"Toxic Queries\"  \n",
    "dataset = client.create_dataset(dataset_name=dataset_name)\n",
    "\n",
    "# 提取输入和输出\n",
    "inputs, outputs = zip(\n",
    "    *[({\"text\": text}, {\"label\": label}) for text, label in examples]\n",
    ")\n",
    "\n",
    "# 创建示例并将其添加到数据集中\n",
    "client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "353b3e32-0f28-4de4-8749-03337905385f",
   "metadata": {},
   "source": [
    "## 3.配置评估器\n",
    "\n",
    "创建一个评估器，将模型输出与数据集中的标注对比以进行评分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0559ea2a-082d-4836-92cd-7473711ee79a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith.schemas import Example, Run\n",
    "\n",
    "# 定义函数用于校正标签\n",
    "def correct_label(root_run: Run, example: Example) -> dict:\n",
    "    # 检查 root_run 的输出是否与 example 的输出标签相同\n",
    "    score = root_run.outputs.get(\"output\") == example.outputs.get(\"label\")\n",
    "    # 返回一个包含分数和键的字典\n",
    "    return {\"score\": int(score), \"key\": \"correct_label\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fe3233b-4762-48fe-bc72-3924a5bc03f6",
   "metadata": {},
   "source": [
    "## 4.执行评估查看结果\n",
    "\n",
    "下面使用`evaluate`方法来运行评估，该方法接受以下参数：\n",
    "\n",
    "- 函数（function）：接受输入字典或对象并返回输出字典或对象\n",
    "- 数据（data): 要在其上进行评估的LangSmith数据集的名称或UUID，或者是示例的迭代器\n",
    "- 评估器（evaluators）: 用于对函数输出进行打分的评估器列表\n",
    "- 实验前缀（experiment_prefix）: 用于给实验名称添加前缀的字符串。如果未提供，则将自动生成一个名称。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "eeec0c29-5e85-46e1-915b-619b68627d63",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ubuntu/miniconda3/envs/langchain/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for experiment: 'Toxic Queries-465b0ea2' at:\n",
      "https://smith.langchain.com/o/3d35c1a5-b729-4d18-b06d-db0f06a30bc1/datasets/e1df55ff-b66c-4bcf-b5fd-7c63a847136e/compare?selectedSessions=2900c5b7-9dd5-482a-ab79-32888be3d5b9\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "6it [00:01,  4.71it/s]\n"
     ]
    }
   ],
   "source": [
    "from langsmith.evaluation import evaluate\n",
    "\n",
    "# 数据集名称\n",
    "dataset_name = \"Toxic Queries\"\n",
    "\n",
    "# evaluator = StringEvaluator(evaluation_name=\"toxic_judge\", grading_function=correct_label)\n",
    "\n",
    "# 评估函数\n",
    "results = evaluate(\n",
    "    # 使用 label_text 函数处理输入\n",
    "    lambda inputs: label_text(inputs[\"text\"]),\n",
    "    data=dataset_name,  # 数据集名称\n",
    "    evaluators=[correct_label],  # 使用 correct_label 评估函数\n",
    "    experiment_prefix=\"Toxic Queries\",  # 实验前缀名称\n",
    "    description=\"Testing the baseline system.\",  # 可选描述信息\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfea48d3-c461-4576-9efe-3ae4af6bd084",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3432c5d1-6e8f-4c42-b18c-a566645c4f40",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "e0f8ea90-1b5f-4761-8f2d-9e19d3b61e15",
   "metadata": {},
   "source": [
    "## 使用 LCEL 重写 RAG Bot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "46817304-1e17-4ca1-a5ba-faebd80c3728",
   "metadata": {},
   "outputs": [],
   "source": [
    "### 索引部分\n",
    "\n",
    "from bs4 import BeautifulSoup as Soup\n",
    "from langchain_community.vectorstores import Chroma\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "# 加载文档\n",
    "url = \"https://python.langchain.com/v0.1/docs/expression_language/\"\n",
    "loader = RecursiveUrlLoader(\n",
    "    url=url, max_depth=20, extractor=lambda x: Soup(x, \"html.parser\").text\n",
    ")\n",
    "docs = loader.load()\n",
    "\n",
    "# 分割文档为小块\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=4500, chunk_overlap=200)\n",
    "splits = text_splitter.split_documents(docs)\n",
    "\n",
    "# 嵌入并存储在 Chroma 中\n",
    "vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\n",
    "\n",
    "# 创建检索器\n",
    "retriever = vectorstore.as_retriever()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "096e3129-8e5e-42b9-8c42-d59f072f20c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "### RAG 机器人部分\n",
    "\n",
    "import openai\n",
    "from langsmith import traceable\n",
    "from langsmith.wrappers import wrap_openai\n",
    "\n",
    "class RagBot:\n",
    "\n",
    "    def __init__(self, retriever, model: str = \"gpt-4-0125-preview\"):\n",
    "        self._retriever = retriever\n",
    "        # 包装客户端以监测 LLM\n",
    "        self._client = wrap_openai(openai.Client())\n",
    "        self._model = model\n",
    "\n",
    "    @traceable()\n",
    "    def retrieve_docs(self, question):\n",
    "        # 调用检索器获取相关文档\n",
    "        return self._retriever.invoke(question)\n",
    "\n",
    "    @traceable()\n",
    "    def invoke_llm(self, question, docs):\n",
    "        # 调用 LLM 生成回复\n",
    "        response = self._client.chat.completions.create(\n",
    "            model=self._model,\n",
    "            messages=[\n",
    "                {\n",
    "                    \"role\": \"system\",\n",
    "                    \"content\": \"你是一个乐于助人的 AI 编码助手，擅长 LCEL。使用以下文档生成简明的代码解决方案回答用户的问题。\\n\\n\"\n",
    "                    f\"## 文档\\n\\n{docs}\",\n",
    "                },\n",
    "                {\"role\": \"user\", \"content\": question},\n",
    "            ],\n",
    "        )\n",
    "\n",
    "        # 评估器将期望 \"answer\" 和 \"contexts\"\n",
    "        return {\n",
    "            \"answer\": response.choices[0].message.content,\n",
    "            \"contexts\": [str(doc) for doc in docs],\n",
    "        }\n",
    "\n",
    "    @traceable()\n",
    "    def get_answer(self, question: str):\n",
    "        # 获取答案\n",
    "        docs = self.retrieve_docs(question)\n",
    "        return self.invoke_llm(question, docs)\n",
    "\n",
    "# 创建 RagBot 实例\n",
    "rag_bot = RagBot(retriever)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "431bbdb3-d4a3-445a-9cfc-2e62adff3ad0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"To build a Retrieval-Augmented Generation (RAG) chain in LCEL, you would need to compose a chain that includes a retriever component to fetch relevant documents or data based on a query, and then pass that retrieved data to a generator model to produce a final output. In LCEL, this would typically involve using `Retriever` and `Generator` components, which you can easily piece together thanks to LCEL's composable nature.\\n\\nThe following example is a simplified step-by-step guide to building a bas\""
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response = rag_bot.get_answer(\"How to build a RAG chain in LCEL?\")\n",
    "response[\"answer\"][:500]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a974aa5-7f2e-42f0-bcc4-05ad35cc10d7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
